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Abstract. In this paper, we describe centering and noncentering method- 
ology as complementary techniques for use in parametrization of broad 
classes of hierarchical models, with a view to the construction of effec- 
tive MCMC algorithms for exploring posterior distributions from these 
models. We give a clear qualitative understanding as to when center- 
ing and noncentering work well, and introduce theory concerning the 
convergence time complexity of Gibbs samplers using centered and non- 
centered parametrizations. We give general recipes for the construction 
of noncentered parametrizations, including an auxiliary variable tech- 
nique called the state-space expansion technique. We also describe par- 
tially noncentered methods, and demonstrate their use in constructing 
robust Gibbs sampler algorithms whose convergence properties are not 
overly sensitive to the data. 

Key words and phrases: Parametrization, hierarchical models, latent 
stochastic processes, MCMC. 



1. INTRODUCTION 

It has long been recognized that the parametriza- 
tion of a hierarchical model can be crucial for the 
performance of the Gibbs sampler or the Metropolis- 
Hastings algorithm which is used to infer about it, 
and interesting research has been done in that direc- 
tion. Parametrization methodology is well developed 
for Gaussian and generalized linear mixed models. 
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Research in this area has benefited from the avail- 
ability of analytic convergence rates for the Gibbs 
sampler on Gaussian target distributions, provided 
by [37]. However, the development of general para- 
metrization strategies for nonlinear hierarchical mod- 
els and in particular for models which involve unob- 
served (latent) stochastic processes has not received 
analogous attention; for example, it is listed as one 
of the important directions for future research in 
the centennial review by Gelfand [12]. While it is 
clearly natural and interpretable to specify models 
according to a hierarchical formulation, more gen- 
eral parametrizations are required for the design of 
appropriate Markov chain Monte Carlo (MCMC) al- 
gorithms for posterior investigation. 

Nevertheless, increasingly complex hierarchical 
models are being used in several areas of applied 
statistics, including econometrics, geostatistics and 
genetics, raising high computational challenges. Much 
research has been devoted to designing high-tech 
MCMC algorithms that aim to optimize performance 
within a particular class of models. Such strategies 
are often ad hoc and not always applicable in a wider 
framework. Moreover, the bag of MCMC tools and 
tricks has over the years grown to such an extent 
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that its full power is largely available only to spe- 
cialists. 

Hence, our aim in this paper has been to focus 
on a general strategy that applies to a wide range 
of statistical contexts. To allow this generality, we 
may sometimes have to sacrifice the quest for opti- 
mality for that of robustness. Our proposed strategy, 
when employing MCMC to perform inference for a 
given hierarchical model, is to use the Gibbs sampler 
or a similar component- wise updating Metropolis- 
Hastings algorithm, and to consider two diff^erent 
parametrizations of the model, which we term the 
centered and the noncentered parametrizations. The 
former is usually the default option, and has been 
used extensively in the literature. The latter, as we 
argue, turns out to be its natural complement, in the 
sense that the Gibbs sampler, when under the one 
parametrization, converges slowly; under the other 
it often converges much faster. This complementary 
role of the two parametrizations constitutes the first 
main advantage of the centered/noncentered frame- 
work. The second advantage is the ability to iden- 
tify, before running any computer program, which 
of the parametrizations is preferable just by look- 
ing at the model structure. In this paper, following 
a tutorial style we show how certain model struc- 
tures interact with the efficiency of the Gibbs sam- 
pler which is used to infer about them, under both 
parametrizations. We classify these model structures 
into three categories, and we demonstrate our argu- 
ments on a series of test models, which correspond 
to popular models used in econometrics, geostatis- 
tics and classification. Our test models are presented 
in their most basic form, since our aim is to present 
the general methods and issues rather than to solve 
specific problems. However, detailed references are 
provided, to direct the interested reader to papers 
which have dealt in detail with MCMC computation 
for the more elaborate versions of the models. 

2. AUGMENTATION SCHEMES AND 
PARAMETRIZATIONS 



where /i is the measure with respect to which the 
density P{X \ Q) is defined. Obviously, there are un- 
limited choices of X which lead to the same posterior 
inference P(0 | Y). 

In this paper we are concerned with the problem 
where MCMC methods are employed to get samples 
from the joint posterior P{X, O | y). 

There are several reasons for introducing such an 
X: (i) the observed likelihood P{Y \ 0) might be an- 
alytically unavailable or too complicated to make 
inference about O, (ii) X itself might have some sci- 
entific interpretation and be of statistical interest, 
and (iii) MCMC methods might be much more effi- 
ciently applied to P{X, G | Y) than to P(G | Y). 

We define a (re) parametrization of an augmenta- 
tion scheme X, by any random pair {X*,Q) with 
joint prior density P{X*,Q) (with respect to an ar- 
bitrary dominating measure) together with a func- 
tion h such that 

(2.1) X = h{X*,e,Y). 

In this general setup, h need not be one-to-one. 
We call a reparametrization practical if P(G | X*), 
hence P(0 | X*,Y), is known up to a normalizing 
constant. 

Once a parametrization (X* ,Q) is chosen, MCMC 
is used to obtain samples from the joint posterior 
P{X*,Q I Y). These samples can be easily trans- 
formed [using (2.1)] to yield samples from P{X,Q \ 
Y). The purpose of the reparametrization in this 
framework is solely to improve the efficiency of the 
Monte Carlo method. 

Hence, for a particular model, we have the dual 
choice of an augmentation scheme and a parametriza- 
tion. This paper is concerned with efficient choice of 
the latter, since this issue is more amenable to a 
general discussion than the former. 

Example 2.1 (Random effects model). As a sim- 
ple example, we consider the simple linear Gaussian 
random effects model, 

Yir^N{CiXi,Ey), 

Xi ^ N(L)0, Sj,), independently for i = 1, . . . , n. 



In the following we will assume that we are given where Cj are individual-level covariates, Xi are indivi- 
a set of observed data Y, unknown parameters dual-level regression parameters, centered around pop- 
with associated prior density P(G) , and an assumed ulation parameters 0, and , S^^. are covariance ma- 
model P{Y I G), which can be conveniently expressed trices. In this case X = {Xi, . . . , Y = {Yi, . . . , 
hierarchically using a hidden layer X as ^n}, and a family of practical reparametrizations is 

described by the transformations 

p{Y\e) = J p{Y\x,@)p{x\e)dfi{x), x, = x: + A,, 
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1 <i <n, for vectors A = {Ai, 1 < i < n} which are 
allowed to depend only on 9. Remarkably, even when 
A is restricted to depend only linearly on 9, this 
family of reparameterizations is sufficiently flexible 
to include complete posterior orthogonalization of 
the state space of {X,@). 

In this paper, we shall not focus on reparametriza- 
tion methods for O, which may of course be impor- 
tant for the construction of efficient MCMC meth- 
ods. 

2.1 Hierarchically Centered Parametrizations 

In this article, we will base the discussion around 
a particular form of augmentation scheme described 
by the graphical model in Figure 1. The correspond- 
ing parametrization of the model in terms of {X, 0) 
is commonly known as the centered parametrization 
(CP). This terminology was first introduced in the 
context of generalized linear hierarchical models by 
Gelfand, Sahu and Carlin [13, 14], and later in a 
more general context by Papaspiliopoulos, Roberts 
and Skold [33] as the parametrization under which 
the data Y are independent of the parameters 
conditionally on the imputed data X. The hierar- 
chical model structure in Figure 1 is common in sta- 
tistical practice and crops up in very diverse areas. 

Example 2.1 is an example of a hierarchically cen- 
tered parametrization and is one of the simplest pos- 
sible models with graphical structure given in Figure 
1. Here we give some further "canonical" examples, 
in their simplest form, in order to illustrate the di- 
versity of the areas for which the methods we present 
in this paper are relevant. 

Example 2.2 {Geo statistical model). In this case 
X = {X{u)^u £ B C R^} is an unobserved spatial 
field, assumed to be a realization from a stochastic 
process, for example, a Gaussian process with mean 
and covariance functions specified up to a few un- 
known parameters G (see, e.g., [9]). For example, 
one could take /i to be the mean level (the specifica- 
tion could easily be adapted to include covariates), 
and (T^S(a) to be the covariance function, where 
o"^ is the overall variance and S(a) is a positive- 
definite correlation matrix with parameter a which 





Y 
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Fig. 1. Graphical model of the centered parametrization 
(CP). 



controls the range of spatial dependence. In this case 
G = {fi,a,a). The data consist of conditionally in- 
dependent observations Yi ~ f{Yi \ X{ui)), where / 
is some density, at a finite number of locations, Ui, 
i=l,...,n; thus Y = {Yi,...,Yn}. 

Example 2.3 {Diffusion stochastic volatility mo- 
del). Diffusion processes are used extensively in fi- 
nancial mathematics and econometrics for model- 
ing the time-evolution of stock prices and exchange 
rates. An empirically observed characteristic of fi- 
nancial time-series data is that the instantaneous 
variance is time varying and exhibits a stochastic 
behavior. A popular class of diffusion-based models 
which can describe this behavior is the stochastic 
volatility models [5, 15], which have the following 
hierarchical structure: 

dYt = eMXt/2)dBi^t, tG[0,r], 

dXt = h{Xt)dt + QdB2,t, 

where j = 1,2, are standard Brownian motions. 
In this case is a continuous-time process, for ex- 
ample, the log-price of the exchange rate between 
two currencies. X = {Xt.,t e [0,T]} represents the 
time- varying log- variance of Yj, which is also as- 
sumed to be a diffusion process, parametrized in 
terms of h{-) (which we assume to be completely 
known to simplify things) and G, which controls 
the variation in the X process. The process Yt is 
observed at time instances tj, i = 1, . . . , n, Y = {Yt^ , 

2.2 Convergence of the Gibbs Sampler Under 
Different Parametrizations 

When used to sample from P(X*,G | y), the al- 
gorithm starts with arbitrary G^, and iterates 

j.l Update X*^ from the distribution 

of X* I G-'-i,y. 

(2.2) 

j.2 Update G-' from the distribution 

oie\X*'fY. 

for J = 1, . . . , A^. The output G^, . . . , G^ is now a 
dependent and approximate sample from P(G | Y). 
The problem that the sample is approximate is an 
important one; see, for example, [19] for a recent re- 
view and synthesis of the methods available for elim- 
inating the so-called burn-in problem. However, the 
major weakness of MCMC is that the serial depen- 
dence can be very strong within the Monte Carlo 
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P(0 I X*^,Y) 



p{e I Y) 





Fig. 2. Posterior distributions of Q given observed (dotted) 
and full augmented (solid) data for a particular value X*-' of 
imputed data X* . 



sample, rendering it unreliable for posterior infer- 
ence, since it will give estimates with very high (or 
even infinite) variance. 

A useful heuristic interpretation of the (lack of) 
performance of Gibbs sampling schemes like (2.2) 
is given in Figure 2. An efficient algorithm here is 
one where the support of P{Q \ X*^ ,Y) closely mim- 
ics that of the target P{@ \ Y) and performance is 
closely related to the relative sizes of the typical dis- 
tance traveled in a single update [i.e., the "size" of 
the support of P{Q \ X*^ ,Y)] and the distance we 
need to travel to cover the support of P(0 | Y) (i.e., 
the "size" of this support). This heuristic can be 
neatly formalized using the concept of the Bayesian 
fraction of missing information (e.g., [23, 39]), de- 
fined for a real-valued, square-integrable function of 
interest f{Q) as 

. E(Var(/(e)|x*,y)|y) 
^'•'^ = ' var(/(e)|y) • 

Liu [23] shows that the maximal correlation coeffi- 
cient, defined as 



(2.4) 



sup 7/, 
/ 



where the supremum is taken over all square-in- 
tegrable functions /, is the geometric convergence 
rate of the Gibbs sampler under the {X*,Q) para- 
metrization. He also shows (see also [1, 24]) that 



(2.5) 



7 = supCorr(/(e^),/(e^+i)) 
/ 



snpCoir{f{e),g{X*)\Y) 

f,9 



where 0*,0*+^ are consecutive output values of the 
Gibbs sampler started at stationary, ^ P(0 | Y). 



Thus, values of 7 close to 1 correspond to slowly 
mixing algorithms, resulting in high autocorrelation 
in the sampled 6 values, and are caused by high de- 
pendence between the updated components. Here 
"close to 1" should be interpreted as very close: 
an algorithm with 7 = 0.9 or even 0.99 mixes suf- 
ficiently well for most practical purposes. Hence, as 
suggested in the Introduction, we will not aim for 
7 ~ but rather try to provide the user with an 
alternative for those (surprisingly common) "very 
close to 1" situations. It is also important to real- 
ize that 7 corresponds to the worst-case scenario; it 
is the rate at which the slowest-mixing functionals 
converge to their stationary means. For geometri- 
cally ergodic algorithms, a more intuitive measure 
of efficiency is what we will call the mixing time, 
and denote by r, 

r = — ~ for 7 ~ 1. 

log7 1-7 

r is proportional to the number of iterations needed 
for the Markov chain to be within certain (total vari- 
ation, say) distance from stationarity. 

As we have mentioned, the centered is often the 
default parametrization and we will refer to the Gibbs 
sampler under CP [i.e., (2.2) with X* = X] as the 
centered algorithm (CA), and denote by 7c and Tc 
its convergence rate and mixing time, respectively. 
There are a number of reasons why a CA forms a 
natural starting point. Conveniently, the conditional 
independence structure in Figure 1 implies that the 
conditional posterior P{Q \ X,Y) = P{Q | X) is of- 
ten easy to sample from. Moreover, the analysis in 
[13, 14] showed that the CA is very efficient for loca- 
tion parameters in (generalized) linear mixed mod- 
els. 

Nevertheless, in view of (2.5) the potential draw- 
backs of hierarchical centering are easily understood. 
Note that X and Q are generally strongly depen- 
dent a priori, and to diminish this dependence the 
data Y need to be strongly informative about X; for 
a simple illustration see Figure 3. Section 3, how- 
ever, reveals important situations where the data, 
even when they are informative about Q, cannot di- 
minish the prior dependence between X and G. In 
particular, in Section 3 we demonstrate that CA: 
(1) can have a mixing time which increases as the 
sample size increases, for example, when the latent 
model P{X \ Q) fails to satisfy classical regularity 
conditions (Section 3.1); (2) can have very unstable 
behavior (e.g., be nongeometrically ergodic), due to 
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Fig. 3. Schematic figure depicting the region of highest support of prior (unshaded) and posterior (shaded) distributions, for 
the CP where Y is weakly (left) and strongly (right) informative about X . This illustrates how strongly informative data break 
down prior dependence between O and X , whereas weakly informative data do not. 



certain robustness properties of the model [in this 
case the maximal correlation coefficient is not ap- 
propriate to quantify the dependence between X 
and Q (Section 3.2)]; and (3) can be reducible due 
to the choice of the augmentation scheme, for exam- 
ple, when the observed data are finite dimensional 
but the imputed data are infinite dimensional, as in 
Example 2.3 (Section 3.3). 

One remedy to these problems is to use the Gibbs 
sampler under a reparametrization. An important — 
and as we show in this paper, complementary to 
CP — subclass of reparametrizations X* are the hi- 
erarchically noncentered parametrizations (NCPs). 
The name originates from normal linear hierarchical 
models [13], but Papaspiliopoulos [30] gave the gen- 
eral definition, that a reparametrization X* is called 
noncentered when X* and G are a priori indepen- 
dent. It is often straightforward for a given model 
to find a noncentered parametrization, as Section 
3 shows, but sometimes we need to resort to more 
involved techniques, as we discuss in Section 4. 

As an example, we return to Example 2.1, and 
note that the parametrization which takes 

Xi = Xi- DQ 

trivially satisfies the requirements of being an NCP. 
In more complex problems, identifying NCPs is typ- 
ically much harder. 

Returning to the general case, the graphical model 
corresponding to this parametrization is given in 
Figure 4. Of course, the conditions of a practical 
reparametrization are trivially satisfied since by con- 
struction P{Q I X*) = P{Q) is available up to a nor- 
malizing constant and hence we can use complex 
transformations h without having to worry about 
computing Jacobian terms. To distinguish this class 



of parametrizations we will denote them by X rather 
than X* . We will also use the notation NCA when 
referring to the Gibbs sampler under NCP, and jnc, 
to refer to its convergence rate and mixing time, re- 
spectively. 

Since X and G are a priori independent, the NCA 
can be much more efficient than the CA when X is 
relatively (to G) weakly identified by the data. In 
lieu, when X is well identified, X and G are often a 
posteriori strongly dependent by means of (2.1); an 
illustration is given in Figure 5. 

In this paper, we consider CP and NCP as gen- 
eral classes of practical parametrizations. Neither is 
"optimal" in any reasonable sense, and clearly for 
any particular problem there might be parametriza- 
tions which outperform (in terms of the efficiency 
of the corresponding Gibbs sampler) both of them. 
However, the merit of the centered/noncentered al- 
gorithms lies in that we can say very general things 
about their performance regardless of the specifici- 
ties of the particular model they are applied to. 
Their strength lies in their ease of extension to high- 
dimensional parameter spaces and complicated data 
structures. Moreover CA and NCA are complemen- 
tary, in the sense that NCAs are likely to perform 
well when the CAs do not and conversely, as we show 
in Section 3. For all these reasons, as we will show 




Fig. 4. Graphical model of NCP. 
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Fig. 5. Schematic figure of prior (unshaded) and posterior (shaded) probability regions for the NCP where Y is weakly (left) 
and strongly (right) informative about X (not X). Here strong data induce strong posterior dependence between Q and X. 
Contrast the panels with the corresponding in Figure 3. 



in the paper, in several cases we can predict their 
performance just by examining the model at hand. 

It is worth mentioning at this stage that the con- 
vergence rate analysis we have presented in this sec- 
tion does not directly apply to algorithms which 
update either of the components X* or O using 
a Metropolis-Hastings step. This updating is often 
done in complex models, where drawing directly from 
the typically multivariate distribution P{X*\@,Y) 
is not feasible. However, the intuition that the higher 
the posterior dependence between X* and Q the 
slower the convergence can be useful even for these 
more general algorithms. 

3. CASE STUDIES 

We here give a few stylized examples to illustrate 
the importance of the choice of centering/noncenterinj 
and their complementary roles. The examples are 
divided into three subcategories, corresponding to 
models that are likely to occur in practice and where 
choice of parametrization can be crucial. Bear in 
mind that the examples are stripped down to their 
most basic versions for illustrative purposes. To avoid 
losing the big picture we use simple heuristic argu- 
ments; however, all the claims can be proved rela- 
tively easily. The notation adopted below clarifies 
how NCP is constructed. 

3.1 Models where the Efficiency Depends on 
Sample Size 

An important characteristic of any MCMC algo- 
rithm is how its efficiency scales with the size of the 
data set. 

Example 3.1 {Repeated measurements). Our first 
example is in the flavor of Gelfand, Sahu and Car- 



lin [13], and Example 2.1. Consider the simple hier- 
archical model described by 

Yi ~ N(X, (Ty), independently for i = 1, . . . , n, 

X = X + Q, X~N(0,cj^), 

where P{Q) oc 1 is chosen. Here {Q,X) is the CP 
and (e,X) is the NCP. 

The fact that the model is Gaussian implies that 
the supremum in (2.4) is attained by linear func- 
tionals; thus Tc = 0(1/ log This is similar to the 
asymptotic argument Gelfand, Sahu and Carlin [13] 
used to support the use of the CA for linear models. 
On the other hand, r^c = 0{n). Hence, the CA im- 
proves while the NCA deteriorates with sample size 
for this simple model. 

Example 3.2 {Hidden Markov model). Consider 
the following trivial hidden Markov model (HMM): 

Yi ~ N(Xj, cjy), independently for z = 1, . . . , n, 

Xi = Xi + e, 

where Xi follows Markov dynamics with station- 
ary distribution N(0, cr^), for instance through Xi = 
pXi^i + (1 — (P'^l'^Ox'Zi for some appropriate stan- 
dard normal innovation Zi and constant \p\ < 1. 
Also assume that the proper prior P(0) = N(0, 1) 
is chosen. This time both Tc and Tnc are 0(1). 

In the case where Xi are chosen to be indepen- 
dent, a complete analytic solution is available for the 
Gibbs sampler (from [37]). For instance, the pres- 
ence of the data to "tie down" the hidden states is 
crucial for the performance of CA. In the absence 
of data, that is, when we want to sample from the 
prior, Tnc = 0{1) (trivially since NCA delivers i.i.d. 
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samples from the prior), but now Tc = 0{n) and per- 
formance of CA will deteriorate with n. We note 
that the assumption of conditional independence in 
the hidden states is not crucial (see Example 3.3 
though). Both algorithms will be 0(1) if the X pro- 
cess is modeled to have moderate serial dependence 
(the NCP needs to be defined appropriately in this 
case); see, for example, [33]. 

The assumption of Gaussian errors in the obser- 
vation and hidden equation is a very important one. 
The remaining examples in this section, and those 
in Section 3.2, elaborate on these non-Gaussian ex- 
tensions. 

Example 3.3 {Nonregular parameters). The fol- 
lowing example illustrates how CA can deteriorate 
with n even if the number of random effects and data 
are equal (n). Nonregular in this example refers to 
the problem of estimating G given X. Assume the 
hierarchical model 

Yi ~ N(Xj, 1), independently for i = 1, . . . , n, 

Xi = <dXi, Xi ~ Uniform(0, 1), 

where P(0) oc 1 is chosen. What makes this model 
particular is that Var(0 | X) = 0{n"'^) while Var(G | 
Y) = 0{n~^) and hence Tc > 0{n). On the other 
hand, it can be shown that r„c = 0(1). 

A similar problem with CA can occur in a variety 
of situations where the latent model P{X\Q) fails 
to satisfy classical regularity conditions, including, 
for example, latent autoregressions Xi = GXj_i + £i 
for Gaussian innovations in the explosive case G > 1 
[45] and Exponential innovations in general [28]. 

In the above example CA ran into problems due 
to the changing support of P{X \ G) with G. Below, 
we give two examples where instead the support of 
P{X I G, Y) changes with Y and the relative perfor- 
mances of CA and NCA are reversed. 

Example 3.4 [Stochastic frontier models). This 
is a class of models very popular in economics (see, 
e.g., [17]), which in their simplest form are written 

as 

Yi — Xi 1^2 5 

Ui ~ Exp(A), independently for i = 1, . . . , n, 

Xi=Q + Xi, X,~N(0,(t2), 

where the Uj's represent company-specific inefficien- 
cies, and we again choose P(G) oc 1. What makes 



this model challenging from a computational per- 
spective is the presence of the asymmetric error in 
the observation equation. This does not cause prob- 
lems to CA, since it is easy to see that Tc = 0(1). On 
the contrary, since G | X, y ~ Exp(nA) truncated on 
G > maxi{yi - Xi}, Var(G \X,Y) = (nA)-^; thus 
Tnc ^ 0{n). Similar models are also used for reaction- 
time data; see, for example, [34]. 

Example 3.5 {Rounded data). Here we consider 
models for truncated data. Assume for example that 

Yi = [Xil , independently for i = 1, . . . , n, 

Xi = Xi + Q, X,~N(0,a2), 

where we use the notation [x\ to denote the largest 
integer less than or equal to x. 

Under CP the augmented and the observed in- 
formation are of the same 0{n); thus Tc = 0(1). 
On the other hand, P(G | X,Y) is supported in 
[maxj{l^ — Xi},mmi{Yi -|- 1 — Xi}]; hence Var(G | 
X,y) = 0(?i-2) and Tnc = 0{n). 

Example 3.6 {Bayesian classification). Consider 
the problem of Bayesian classification as in, for ex- 
ample, Lavine and West [22]. Assume we are in- 
terested in whether a set of covariates Y can be 
used to distinguish between two classes of individu- 
als (e.g., healthy /sick). Based on a training sample 
with known classes, we have derived the posterior 
predictive distributions /o and /i of the covariates 
based on the respective classes. The same covariates 
are now measured on a number n of individuals with 
unknown status. Hence, we have independent data 
Yi,...,Yn from the two-component mixture model 
with density efo{Yi) + (1 - Q)fi{Yi), where G is 
the proportion of individuals of class to which 
we assign a uniform prior. For computational conve- 
nience, this model is usually augmented with indi- 
cators Xi and rewritten in hierarchical form as (see, 
e.g., [8]) 

Yi ~ fxi (Yi), independently for i = 1, . . . , n, 
Xi = l{Xi < G}, Xi ~ Uniform(0, 1). 
Here we might consider two extreme scenarios (with 

1. Perfect classification. If /o and /i have disjoint 
supports, we can perfectly distinguish the classes 
and P(G \X,Y) = P(G | Y). Hence, the CA wiU 
produce independent replications from the poste- 
rior. For the NCA, however, G | X,Y ^ 
Uniform(X(-7v), X(7v+i)). Hence Var(G | X,Y) = 
0{n~'^) and r„c > 0(n) since Var(G | Y) = 0(n^^). 
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2. Completely unrelated covariates. Now assume the 
covariates turn out not to be related to class, that 
is, /o = /i. Here P{@ \X,Y) = P{e) and the 
NCA will produce independent replications. On 
the other hand, 9 | X ~ Beta(A^ + 1, (n - A^) + 
1) and Var(e | X) = 0{n-^) while Var(G | Y) = 
0(1). Hence, Tc>0{n). 

Of course realistic examples often tend to fall in be- 
tween scenarios 1 and 2 above. Nevertheless, they 
convey some intuition about the behavior in their 
neighborhoods. 

3.2 Models where the EfFiciency Depends on the 
Tails of the Links 

Section 2.1 related the convergence rate of the 
two-component Gibbs sampler to the maximal cor- 
relation coefficient 7 between the updated compo- 
nents. Nevertheless, 7 is a measure of global depen- 
dence; it is thus inappropriate when the dependence 
between X and G is very different in different parts 
of the space. In these situations, CA can have quali- 
tatively very different behavior than NCA. The fol- 
lowing characteristic example is taken from [32]. 

Example 3.7 {Heavy-tailed HMM). Consider the 
following "innocent" -looking modification of the 
HMM given in Example 3.2: 

Yi ~ Cauchy(Xi, CTy), 

independently for i = 1, . . . , n, 

Xi = Xi + Q, X,~N(0,c72), 

where the improper prior P(0) oc 1 is chosen. There- 
fore, we have robustified the observation equation 
by replacing the Gaussian error by a Cauchy with 
median Xi and scale parameter ay. Figure 6 shows 
the contours of the posterior distribution of (A, O), 
for n = 1, Y = 0, ay = 1, a'^ = 5. Near the mode, 
X and Q are approximately independent: however, 
as — > 00, the conditional posterior distribution of 
X I O, y tends to be concentrated around G and ig- 
nores the data Y. The reason for this is that when 
|G — y I is large, the two sources of information about 
X , the prior and the likelihood, are confiicting with 
each other, and inference for X is dominated by the 
link with the lighter tails. In fact, Papaspiliopoulos 
and Roberts [32] show that CA is not even geomet- 
rically ergodic (therefore jc = l)j whereas NCA is 
geometrically (in fact it is uniformly) ergodic. Note 
that in the complementary model 

Yi ~ N(Aj, (Ty), independently for i = 1, . . . , n, 

Xi = X^ + G, Ai~ Cauchy (0,cj^), 



the performance of CA and NCA is reversed, and 
CA is uniformly ergodic while NCA is subgeometric. 

Papaspiliopoulos and Roberts [32] derive very gen- 
eral results for hierarchical models with linear struc- 
ture, including time-series and spatial models where 
A is a latent Gaussian field which is observed at cer- 
tain locations with heavy-tailed noise. Models with 
heavy-tailed observation error are used for protect- 
ing the inference about X from outlying observa- 
tions; see, for example, [3, 20, 43]. 

3.3 Models where the EfFiciency Depends on the 
Amount of Imputation 

In modern complex hierarchical modeling, it is 
common to impute infinite-dimensional objects X, 
or alternatively very fine finite-dimensional approxi- 
mations of them. The canonical example in this case 
is the diffusion stochastic volatility model, presented 
in Example 2.3 and revisited in Example 3.8 be- 
low, but this type of augmentation scheme is under- 
taken also in Bayesian nonparametric modeling, as 
we show in Example 3.9 below. Often in such sit- 
uations ergodicity constraints link X and 0, in the 
sense that G (or some components of it, if it is a vec- 
tor) might be completely specified given X. Then, 
the convergence of CA deteriorates as the approxi- 
mation gets finer (it is reducible in the limit). One 
approach to circumvent this problem is to jointly up- 
date G and X, advocated, for example, in Darren 
Wilkinson's discussion to [33] and in [21]. However, 
joint updates are often complicated to implement, 
and might, for example, necessitate reversible-jump 
algorithms that are difficult to tune and diagnose. 
Here we argue that NCA often provides a simpler 
and efficient alternative in such situations. The ex- 
amples are natural opposites to Example 3.1 in the 
sense that dimension of augmented rather than ob- 
served data increases with n. 

Example 3.8 {Dijfusion stochastic volatility model, 
revisited). Here we revisit the model presented in 
Example 2.3. Assume for illustration that one obser- 
vation from the price process is available, Y = Yt^ , 
thus y I A ~ N(0,/o' ds). However, the distri- 
bution of the path integral is not known; thus in 
order to compute the likelihood one needs to im- 
pute the whole path X = {As,0 < s < ii}, or as 
it is usually done, a very fine discretization Xi-n = 
{Xn^l^^i = 0, . . . ,n}, for large n, and approximate 
the integral appropriately. However, the quadratic 
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variation identity 



lim VfX 

i=l 



iti/n 



(i-l)tl/ 



shows that Q is completely determined by any string 
of sample path X {n = oo), whereas for large but fi- 
nite n, Xi-n and G are very strongly dependent a 
priori. The limiting centered algorithm which im- 
putes X is actually reducible: for any starting B'', 
P{X \Y,Q^) is a distribution on the path space con- 
centrated on paths for which their quadratic varia- 
tion is (6")^; and conditionally on such X^ '-^ P{X \ 
y,e°), P{Q I Y,X^) is concentrated on G°. Roberts 
and Stramer [38] show that the CA on (Xi-n,@) is 
0(n); that is, the better we try to approximate the 
likelihood [and therefore the posterior P(G | Y)], the 
worse the convergence of the algorithm. 

A possible choice for NCP is to take X = {i?2,s5 < 
s < ti}, the driving Brownian motion of the volatil- 
ity equation, and the associated approximation Xi-n = 
{^2,iti/n) ^ = 0, . . . , re}; this approach has recently 
been adopted in [5]. NCA is 0(1), thus it does not 
deteriorate as the approximation to the true con- 
tinuous-time model gets better. Roberts and Stramer 



[38] suggested an alternative reparametrization X* , 
which qualifies to be called noncentered only when 
6 = (a case, however, which has motivated the con- 
struction). If we define X^ = Xt/Q, then direct ap- 
plication of Ito's formula shows that 

b{ex:) 



dx; 



e 



■ dt + dB2 i 



Clearly, when 6 = 0, X* is an NCP and in fact coin- 
cides with the one suggested earlier. When 6/0, the 
distribution of X* does depend on 0; however, any 
realization of X* contains only finite information 
about 0. (Diffusion sample paths on a fixed time 
interval, say [0,ti], under mild regularity conditions 
contain finite information about any parameters in 
the drift, but infinite information about parameters 
in the diffusion coefficient.) Therefore, the Gibbs al- 
gorithm which updates G and a fine approximation 
Xl,^ = i = 0, . . . , re} win also be 0(1). In the 

context of diffusion processes, this parametrization 
is particularly appealing, as it will be demonstrated 
in Section 5. 

Example 3.9 {Bayesian nonparametrics) . Avery 
active area of research is that of Bayesian nonpara- 




FlG. 6. Contour plot of the posterior distribution of {X, O) from the heavy-tailed HMM, with n—1, Y = 0, ay — 1, 
Notice that near the mode, X and O are roughly independent, whereas they become highly dependent in the tails. 
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metric statistics. In this area, unknown distributions 
and functions are modeled using flexible stochastic 
process priors. For example, a popular choice is the 
Dirichlet process prior [11], which is a prior on the 
space of all distribution functions on a given space, 
say Z. When we write that P ^ DP{Q,H), we im- 
ply that P is a random measure distributed accord- 
ing to the Dirichlet process prior with parameters 
> 0, and H which is a (known) distribution on Z. 
Many methods for carrying out inference on models 
containing hidden Dirichlet components work by in- 
tegrating out the Dirichlet process. In what follows 
we will deliberately not adopt this approach, with a 
view toward generalizations to other nonparametric 
models where integrating out is not an option. 
A constructive series representation [42] gives that 

oo 

(3.1) P(.) = Y^p.,5z^(.), 

i=l 

where Zi are i.i.d. variables from H, and 
pi = Xi and 

p^ = {l- Xi)(l - X2) ■ • ■ (1 - Xi_i)X,, for i > 2, 

where the Xj's are independent Beta(l, Q) variables. 
Therefore, every realization of P is an infinite mix- 
ture of random point masses pi, located at random 
locations Zi £ Z. Notice that X and G are linked 
through the ergodicity constraint 




P is commonly used at an intermediate stage in hi- 
erarchical models. A popular class of such models, 
used for classification, density estimation and non- 
parametric regression, is the Dirichlet mixture mod- 
els; see, for example, [10, 26]. When X = {Xi,X2, ■ ■ ■} 
and Z = {Zi, Z2, ■ ■ ■} are imputed in order to facili- 
tate inference for the hierarchical model, as for ex- 
ample in [31], the CA is reducible since X and @ are 
linked by the constraint given above. Approximate 
centered methods, which impute only an approxima- 
tion Xi-n = {Xi, . . . ,Xn}, Zi-n = {Zi, Z„,}, aS 
in [18], will have 0{n) mixing time. Papaspiliopou- 
los and Roberts [31] suggest a simple NCP to come 
round this problem, given by the inverse-CDF con- 
struction, Xi = 1 — (Xj)^/® for i.i.d. uniforms Xi ^ 
Uniform(0, 1). 

Similar computational problems appear in a vari- 
ety of other nonparametric models, where, for exam- 
ple. Levy processes are used at intermediate stages 
in hierarchical models. For a description of such mod- 
els, see, for example, [44]. 



4. CONSTRUCTING NCR'S 

We have already promoted the centered / noncentered 
methodology as a default strategy for tackling chal- 
lenging computational problems, and we have given 
insight on when each should be used (or avoided) 
through the case studies in Section 3. Typically, CP 
is naturally suggested by the augmentation scheme 
chosen for a particular hierarchical model (in fact 
usually they coincide). On the other hand, in the 
examples we have presented so far NCP has been 
quite obvious (maybe with the exception of the dif- 
fusion model in Example 3.8). Some tricks we have 
applied were: 

• Location: if X~F(- -6), 

then h{X, G) = X + 9, X ~ F(-) (e.g.. Examples 

3.2, 3.7). 

• Scale: iiX^F{-/Q), 

then h{X,e) = QX, X ~ F(-) (e.g.. Examples 

3.3, 3.8). 

• Inverse CDF: if X ~ F@{-) where Fq is a distri- 
bution function on R, 

then h{X, 9) = Fq^{X), X ~ Uniform(0, 1) (Ex- 
amples 3.6, 3.9). 

Clearly, the above "recipes" can be combined in dif- 
ferent ways. For example, suppose that X is a latent 
Gaussian field as in Example 2.2. An NCP that has 
been successfully applied in this context (see, e.g., 
[25] in the context of log-Gaussian Cox processes 
and [7] for spatial generalized linear mixed models) 
is 

(4.1) X = h{X,Q) = aJ:{a)^/^X + n, 

for a vector of i.i.d. standard normals X and where 
5](a)^/^ is, for example, provided by a Cholesky fac- 
torization of T,{a) . Similarly, if Xi , . . . , Xt is a Markov 
chain with transition CDF Fq{- \ •), the inverse CDF 
can be applied recursively to construct an NCP 
through Xi = FQ^{Xi \ Xi^i) for a sequence of in- 
dependent uniforms Xi, . . . , Xn- 

It is not easy to give very general instructions on 
how to produce an NCP for arbitrary latent struc- 
tures. In a way, this is the price to pay for having a 
method which in principle could be applied in any 
context. However, several probabilistic relationships 
can be exploited; see, for example, [30]. Generally, if 
you know how to simulate from the prior P{X \ 9), 
a careful inspection of the algorithm often naturally 
suggests an NCP. Indeed, the tricks displayed above 
are all standard tools in simulation. Noncentering 
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of stochastic processes can often be achieved by ref- 
erence to various representation theorems. For ex- 
ample, we followed this approach in Example 3.9 
using a constructive representation of the Dirichlet 
process. 

An important observation, however, which can be 
used to construct NCP for complex latent struc- 
tures is that X can take values on a much higher- 
dimensional space than X and the function h in 
(2.1) can be noninvertible, as we discuss in the se- 
quel. 

4.1 State-Space Expansion 

Notice that the Gibbs sampler which samples from 
the posterior P{X, Q \ Y) can be implemented in the 
following manner. Assume that the current state is 

j.l Update X according to the distribution of 
X I eJ-\y, to give X^'^V2_ 

j.2 Transform {X^~^l'^ ,Qj~^) ^ X^"^'"^ . 

j.3 Update G-' according to the distribution of 

@\x^~'^\y. 

jA Transform {X^~^^'^,e^) X^ . 

Here step jA is unnecessary if it is possible to sam- 
ple directly from the distribution in jA. We have 
included steps j.1-2 and jA, rather than a single 

step updating X'' from X \ @^~^,Y, for two reasons: 
first, it illustrates how code from a scheme like (2.2) 
could be reused, and second, it allows for situations 
where X \ Q^~^,Y is not easily updated. 

While step j.2 in (4.2) above will often be im- 
plemented through a deterministic transformation 
X = h~^{X, Q), equality in distribution is sufficient. 
Hence, the step can be performed by a draw from X \ 
X^ , 0-^"^, which allows for situations where /i is a 
many-to-one function. In Example 3.6, for instance, 
where X = 1{X < G}, in step j.2 we draw X from 
a Uniform(0, 1) if X = and from a Uniform(0, G) 
distribution if X = 1. 

Another idea, suggested and elaborated in [30], is 
that step j.3 in (4.2) can sometimes be performed 
using only partial information about X , and hence 
the full X need not be computed in step j.2. This 
is the case when j.3 is performed through an ac- 
cept/reject mechanism like the Metropolis-Hastings 
algorithm and h{X,-) only needs to be computed 
for the current value G-'"^ of Q and a proposed 



new value G'. Specifically, X can live on an infinite- 
dimensional space (while the dimension of X can 
still be finite) and this flexibility makes noncenter- 
ing feasible to apply in very general contexts; see the 
example below for an illustration and [36] for a fully 
worked example. The method described in Example 
4.1 below has recently been found necessary in or- 
der to construct MCMC algorithms for inference for 
discretely observed diffusion processes in [2]. 

Example 4.1 (Latent Poisson processes). Pois- 
son processes are natural a priori models for event 
times and spatial locations of objects. Often data are 
partial or consist of a noisy image from which spatial 
locations (X) and intensity (G) are to be inferred. 
Note that in latent point process models, X | G,y 
can rarely be updated exactly and slow reversible- 
jump algorithms are commonly applied. Hence, it 
is especially important here that the MCMC algo- 
rithm can move freely along the G-axis. 

We will discuss two possible ways to construct an 
NCP X for a Poisson process X with rate G on 
the unit interval. They both involve state-space ex- 
panding techniques (see, e.g., [30, 33]) where X is 
formally an infinite-dimensional object. The ideas 
carry over in a straightforward manner to Poisson 
processes on arbitrary spaces with general intensity 
functions parametrized in terms of a vector of pa- 
rameters G: 

1. X is a unit rate Po-process on [0,1] x [0,oo). 
If {{ui,vi), . . .} is an enumeration of the point- 
coordinates of X, we define X = h{X, G) = {uj; 
^'i ^ G} (see the left frame in Figure 7). 

2. X is a unit rate Po-process on [0, oo) with points 
{ui,...} and X = h{X,e) = {ui/Q;ui<Q}. 

Again, what makes the above strategies feasible is 
that for a fixed G, X and hence L{h{X ,Q),Q) only 
depends on a finite subset of the points in X. Hence, 
if G is updated using a Metropolis-Hastings algo- 
rithm, we only need to store partial information 
about X in each step. More implementation details 
with an application to shot-noise type stochastic 
volatility processes can be found in [36]. 

Example 4.2 (Hidden Markov jump processes). 
Given the above recipes for constructing NCPs for 
Poisson processes, the step to Markov jump pro- 
cesses is not far. Such models are natural for de- 
scribing the variation of populations and epidemics 
in time; see, for example, [4, 27, 29, 35] for a variety 
of applications of hidden Markov jump processes. 
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Fig. 7. Two different NCPs of a 0-rate Poisson process on the unit interval. In the left figure X is a unit rate Po-process 
on [0,1] X [0,cxd) and in the right on [0,oo). Circles denote the points of X and crosses the points of X obtained after the 
many-to-one transformation X = h{X,0). 



For example, we might consider an immigration- 
death process X{t), t E [0,T], defined through 

P{X{t + h) = k\ X{t)=l) 

= Xh + o{h), k = l + l, 

= 1- {X + ln)h + o{h), 1 = 1, 

= Ifih -\- o{h), k = l — l, 

= o{h), otherwise, 

for i = 0, 1, . . . and with Q = (A, fi) for immigration 
and death rates A and /i, respectively. 

Processes like X{t) are often simulated recursively 
based on a sequence of independent exponential and 
uniform variables, corresponding to interarrival times 
and type of event, respectively. Hence, we define 
Xi = {zi,Ui), i = l, . . . , with Zi and Ui being Exp(l) 
and Uniform(0, 1), respectively. Further, denoting 
by Tj the time of the ith. event, the transformation 
X = h{X , Q) is defined recursively through 

Ti = Ti-i + Zi/ (A + X('rj_i)^), 

X{t) = X(r,_i) + 21{u, < A/(A + Xin_i)fi)} - 1, 

t e [Ti^i,Ti), 

for To = 0, 1 = 1, 

5. ROBUSTNESS AND DATA-BASED 
MODIFICATIONS 

When implementing "automatic" MCMC routines 
in software one is often faced with the problem that, 
under the same model, performance can vary be- 
tween different data sets (see, e.g., [6]). Hence, for 
example, CP might be preferred by some data sets 
and NCP by others, depending on how informative 
the particular realization of the data is for X. An- 
other problem is that the relative performance of 



the two strategies might vary within the support 
of P{X,Q I Y). We have already encountered this 
problem in Example 3.7 (see especially Figure 6); 
here a CP is preferable around the mode while it is 
necessary to use an NCP in the tails. 

Fortunately, it is not necessary to make a choice 
between a CA and an NCA: alternating the two 
within the same algorithm might greatly improve 
the robustness without causing increase in compu- 
tational time of practical relevance. In fact, such a 
combined strategy has been shown to be necessary 
in a number of practical situations (e.g., [16, 27, 32]). 

5.1 Correcting for the Presence of Data 

Both CPs and NCPs are constructed exclusively 
based on the prior distributions of the model. Since 
this prior distribution is usually manageable (in com- 
parison with the posterior), we have seen how both 
parametrizations can be applied through a wide class 
of statistical models and data structures. Inevitably, 
the a priori construction can also be viewed as a 
weakness and one might argue that when efficiently 
parametrizing the posterior, data should be taken 
into account. 

In this section we will discuss models where sim- 
ple data-based modifications of CPs and NCPs ei- 
ther are necessary or simply improve performance. 
We show that even when CPs and NCPs are un- 
successful, they often form the first step toward an 
efficient strategy. 

5.1.1 Correcting the CP. Problems relating to the 
CP are fairly easily understood and it is natural to 
look for reparametrizations in terms of conditionally 
pivotal quantities X* . 

Consider, for example, a linear reparametrization 

(5.1) X = v^^^{@,Y)X* + m{e,Y), 
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where the hnearity in X* ensures that P(0 | X*) 
is available. If X belongs to a location-scale fam- 
ily, an NCP is given by v{Q,Y) = \ai[X \ Q) and 
m(0,y) = E(X I 0). For more general families this 
can be seen as a first-order approximation of an 
NCP. When correcting this parametrization for the 
presence of data, it is natural to consider v{Q,Y) = 
Var(X I e,y) and m{Q,Y) = E(X | e,Y). Indeed, 
when also X | 0,1" is location-scale, this choice will 
make X* independent of 0. The approach allows 
data to decide the parametrization and naturally 
gives an NCP for "infinitely weak data" and a CP 
for "infinitely strong data." Hence, (5.1) can be in- 
terpreted as a partially noncentered parametrization 
(see [30, 33]). 

While these choices of v and m are not analytically 
available in general, they can often be sufficiently 
well approximated from, for example, a quadratic 
expansion of the log-posterior; see [6] for a successful 
application in the context of spatial GLMMs. 

5.1.2 Correcting the NCP. Of course, we can also 
consider reparametrizations of NCPs, that is, 

(5.2) X = h{X,e) = h{h{X*,e,Y),@). 

While in (5.1) it was natural to search for an ap- 
proximate pivotal quantity X* , the function h in 
(5.2) often fills the purpose of relieving hard con- 
straints on X imposed by data. The situation arises 
naturally in models for discretely observed stochas- 
tic processes, where the unobserved process values 
are augmented to facilitate likelihood calculations or 
prediction. 

Example 5.1 (Discretely observed diffusions). 
This is close in spirit to Examples 2.3 and 3.8. We 
again assume that Xt is the solution to the stochas- 
tic differential equation 

dXt = h{Xt)dt + QdXt; 

however, instead of X being unobserved, here it is 
discretely observed. Thus, assuming for simplicity 
one observation, the data are Y = Xt^, for some 
< ti. The likelihood function, P{Xt^ | 0), is typi- 
cally unavailable when h[x) is a nonlinear function 
of X. Instead, it is easy (using the so-called Gir- 
sanov formula) to write down the likelihood given 
the complete path X = {Xs , s < ii } , or at least to 
approximate it accurately enough using a fine dis- 
cretization Xi-n = {Xitx/ni i = 0, ■ ■ ■ , n}, for large n. 
It can be recognized that the setting is very sim- 
ilar to that in Example 3.8, and we have already 



seen that a CA which imputes Xi-n has 0{n) mixing 
time, due to the quadratic variation identity bind- 
ing with Xi-n, and should be avoided. Neverthe- 
less, the NCA which was successful in Example 3.8 
does not work here, since by construction, y is a 
deterministic function of and X . Therefore it is 
impossible to update given X without violating 
the constraint imposed by the data. 

The natural sequence of reparametrizations for 
this problem is given by 

x{t) = ex{t) = h{x*,e,Y) = Q{x*{t)-tY/e); 

see [38] for more details. 

Example 5.2 (Caussian Markov random fields). 
In many applications, observed data can be repre- 
sented Gaussian field sampled at irregularly 
spaced locations. A well-known problem with this 
approach is that likelihood computations can be bur- 
densome, especially for large data sets, due to the 
necessary inversion of the covariance matrix. On the 
other hand, if the covariance matrix is sparse, com- 
putations can be speeded up considerably. One ap- 
proach (due to Rue [40] ) to allow for fast estimation 
is to approximate the Gaussian field by a Markov 
random field, by embedding the points into a rect- 
angular lattice equipped with a (suitable) Markov 
random field neighborhood structure. As a result, 
the covariance matrix E of the approximate model 
is sparse. The unavailable observations at the empty 
lattice points are treated as missing data and are 
imputed. For notational simplicity, we assume that 
only the location parameter is unknown. More- 
over, we set X = (Yi, . . . ,Yn, X[, . . . , X'^)'^ to be the 
values of the field at the lattice point, arranged such 
that the first n values are actually observed, and the 
remaining m are unobserved. Therefore, under the 
approximate model, X | ~ N(0,S), (X, 0) is the 
CP, and X is a partially observed process. Given the 
complete data X, and assuming a fiat prior for 0, 
we have that 

P(0 I X) oc exp(-(A: - 01)^E-^(A: - 01)/2), 

where 1 = (1, . . . , 1)"^. 

Often a large number of lattice points needs to be 
augmented (?i <C m) (see, e.g., [41]), hence the aug- 
mented data can contain vastly more information 
about than observed data Y = (yi,...,y„) and 
the CA will perform poorly. On the other hand, the 
corresponding NCA defined through X = X — 01 ^ 
N(0, S) will in fact be reducible since Q = Yi — X^, 
i = 1, . . . ,n. 
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Here it is natural to choose X* = {X^, . . . , X!^) to 
be zero mean with the appropriate covariance struc- 
ture and 

X = X + m = h{X*,Q,Y) 

= (yi-e,...,y„-G,x*,...,x;;f + G1. 

Hence, in effect we center the observed and noncen- 
ter the unobserved values. 

6. DISCUSSION 

We have tried to demonstrate in this paper the 
very wide applicability and complementary roles of 
centering and noncentering approaches to parame- 
trization of hierarchical models. Neither centering 
nor noncentering methods are uniformly effective; 
however, we give a clear qualitative (and in some 
cases quantitative) understanding of when noncen- 
tering parametrizations are likely to be effective. 
Conveniently, the two strategies possess complemen- 
tary strengths, making their combined use an attrac- 
tive and relatively robust option. 

A significant advantage of centered over noncen- 
tered methods is that they explicitly use the con- 
ditional independence structure inherent from the 
hierarchical setup. It is common for conjugate con- 
ditional distributions for parameters to exist only 
when using the CP. Thus, in some cases this can lead 
to a significant relative computational cost for the 
NCP. Our experience suggests that in many cases, 
any extra computational burden is more than offset 
by the convergence advantages resulting from the 
appropriate use of noncentering. See [27] for a sim- 
ulation study of algorithm efficiency in the partially 
observed stochastic epidemic context. This work ex- 
plicitly considers the loss of efficiency due to the 
breakdown of conjugacy and compares this with the 
benefits of more rapid convergence. 

Noncentered parametrizations are certainly not 
unique. We describe how quite general classes of 
models allow the easy construction of noncentered 
parametrizations. Perhaps the most generally appli- 
cable technique for this purpose is the state-space 
expansion technique. Here a key feature of noncen- 
tered parametrizations is that they are by construc- 
tion always practical, that is, P{Q\X) = P{Q) is al- 
ways available up to a normalizing constant. 

It is by now well established empirically and in 
some cases theoretically, that parametrizations which 
are superior to both the centered and noncentered 



strategies can usually be constructed, either by con- 
structing a suitable continuum of parametrizations 
between centered and noncentered, or perhaps by 
more data-specific methods such as the construction 
used for diffusion processes in Example 5.1. Partially 
noncentered parametrizations are sometimes diffi- 
cult to construct, though the methods have appeal- 
ing robustness characteristics with respect to differ- 
ent data sets and data types. 

Further work is required in a number of directions. 
Applications of (partial) noncentering parametriza- 
tions are as yet uncommon, and much remains to be 
learned about the effectiveness of the method, par- 
ticularly taking into account the possible additional 
computing overheads sometimes associated with the 
use of these methods. Also further study is required 
to generalize our theoretical knowledge beyond the 
examples we are able to cover here. 
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